The goal of this project is the investigate what causes Serious and Fatal accidents in hopes of preventing and decreasing the number of them. The dataset consists of accident records from the UK over the course of 15+ years. I hope to show the causes of these accidents through visualizations and create an algorithm that can predict the severity of accidents.
The UK government collects and publishes (usually on an annual basis) detailed information about traffic accidents across the country. This information includes, but is not limited to, geographical locations, weather conditions, type of vehicles, number of casualties and vehicle manoeuvres, making this a very interesting and comprehensive dataset for analysis and research.
The data that I'm using is compiled and available through Kaggle and in a less compliled form, here.
Genesis L. Taylor
Github | Linkedin | Tableau | genesisltaylor@gmail.com
Problem: Traffic Accidents
Solution Method: Use data to figure out how to lower the number of accidents and the severity of them.
UK Road Safety: Traffic Accidents and Vehicles Introduction, Data Cleaning, and Feature Manipulation
UK Road Safety: Traffic Accidents and Vehicles Introduction, Data Cleaning, and Feature Manipulation: Github Link
UK Road Safety: Traffic Accidents and Vehicles Visualizations and Solution
UK Road Safety: Traffic Accidents and Vehicles Visualizations and Solution: Github Link
UK Road Safety: Traffic Accidents and Vehicles Machine Learning
UK Road Safety: Traffic Accidents and Vehicles Machine Learning: Github Link
Traffic Analysis and Severity Prediction Powerpoint Presentation
Traffic Analysis and Severity Prediction Powerpoint Presentation: Github Link
#Import modules
import numpy as np
import holidays
import pandas as pd
import seaborn as sns
import pickle
import time
import timeit
import matplotlib.pyplot as plt
plt.style.use('dark_background')
%matplotlib inline
import datetime
import math
from collections import Counter
#scipy
import scipy.stats as stats
from scipy import stats
from scipy.stats import chi2_contingency
#sklearn
import sklearn
from sklearn import ensemble
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, ExtraTreesClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, OrdinalEncoder
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.utils import resample
#for clustering
from sklearn.cluster import KMeans
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
from sklearn.cluster import MiniBatchKMeans
from sklearn.metrics import silhouette_score
#other learners
from xgboost import XGBClassifier
import lightgbm as lgb
from kmodes.kmodes import KModes
#imblearn
from imblearn.ensemble import BalancedBaggingClassifier
from imblearn.ensemble import EasyEnsembleClassifier
from imblearn.ensemble import BalancedRandomForestClassifier
#webscraping
import requests
from bs4 import BeautifulSoup
import re
import urllib
from IPython.core.display import HTML
#time series
import statsmodels.api as sm
from pylab import rcParams
import itertools
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA
#warning ignorer
import warnings
warnings.filterwarnings("ignore")
# # #DATAFRAME PICKLE CREATED IN CELLS BELOW INSTEAD OF RUNNING THROUGH ENTIRE PROCESS AFTER RESTARTING
# # #import pickled file
# df = pd.read_pickle("df.pkl")
# df.to_csv('uktraffic_acc.csv')
#import files
ac = pd.read_csv(r'Accident_Information.csv', low_memory=False, chunksize=30000)
vc = pd.read_csv(r'Vehicle_Information.csv', low_memory=False, chunksize=30000)
Previously, I did not remove "Data missing or out of range" from the datasets however through cleaning and checking the value counts I decided to do so for sanity purposes only. Most of the percentages that had this as a value were not a high percentage either.
#chunk cleaning and dataframing for accident column
acchunk = []
for chunk in ac:
acchunk_filter = chunk[
(chunk.Year.astype(int) >= 2010) &
(chunk.Year.astype(int) <= 2017) &
(chunk['Road_Type'] != "Unknown") &
(chunk['Junction_Control'] != "Data missing or out of range") &
(chunk['Carriageway_Hazards'] != "Data missing or out of range") &
(chunk['Junction_Detail'] != "Data missing or out of range") &
(chunk['Road_Surface_Conditions'] != "Data missing or out of range") &
(chunk['Special_Conditions_at_Site'] != "Data missing or out of range") &
(chunk['Weather_Conditions'] != "Data missing or out of range") &
(chunk['Latitude'].notnull()) &
(chunk['Longitude'].notnull())
]
acchunk.append(acchunk_filter)
df1 = pd.concat(acchunk)
#chunk cleaning for vehicles column
vcchunk = []
for chunk2 in vc:
vcchunk_filter = chunk2[
(chunk2.Year.astype(int) >= 2010)&
(chunk2.Year.astype(int) <= 2017) &
(chunk2['Driver_Home_Area_Type'] != "Data missing or out of range") &
(chunk2['Journey_Purpose_of_Driver'] != "Data missing or out of range") &
(chunk2['Junction_Location'] != "Data missing or out of range") &
(chunk2['Was_Vehicle_Left_Hand_Drive'] != "Data missing or out of range") &
(chunk2['Hit_Object_in_Carriageway'] != "Data missing or out of range") &
(chunk2['Skidding_and_Overturning'] != "Data missing or out of range") &
(chunk2['Towing_and_Articulation'] != "Data missing or out of range") &
(chunk2['Vehicle_Leaving_Carriageway'] != "Data missing or out of range") &
(chunk2['Vehicle_Manoeuvre'] != "Data missing or out of range") &
(chunk2['Vehicle_Type'] != "Data missing or out of range") &
(chunk2['X1st_Point_of_Impact'] != "Data missing or out of range") &
(chunk2['Sex_of_Driver'] != "Data missing or out of range") &
(chunk2['Age_Band_of_Driver'] != "Data missing or out of range")
]
vcchunk.append(vcchunk_filter)
df2 = pd.concat(vcchunk)
#check columns
print("Accident's Columns:\n",df1.columns, "\n")
print("Vehicle's Columns:\n",df2.columns)
print('Accident Shape', df1.shape)
print('Vehicle Shape',df2.shape)
#merge dataframes
df = pd.merge(df1,df2)
#check columns
print("Names of Combined Columns:\n",df.columns, "\n")
print("\nShape:\n",df.shape)
df.describe(include ='all')
#check corr b/t Location_Easting_OSGR & Location_Northing_OSGR AND Longitude and Latitude
print(df['Location_Easting_OSGR'].corr(df['Longitude']))
print(df['Location_Northing_OSGR'].corr(df['Latitude']))
#drop Location_Easting_OSGR & Location_Northing_OSGR
#because they are the similar to Latitude and Longitude
df = df.drop(['Location_Easting_OSGR', 'Location_Northing_OSGR'], axis=1)
df.shape
#standardize all column names to lowercase, and remove some characters
#for ease of use in querying
df.columns = map(str.lower, df.columns)
df.columns = df.columns.str.replace('.','')
df.columns = df.columns.str.replace('(','')
df.columns = df.columns.str.replace(')','')
#convert date/time to datetime datatype
df['date'] = pd.to_datetime((df['date']), format= "%Y-%m-%d")
#df.dtypes
#mistyped datatypes
df[['did_police_officer_attend_scene_of_accident',
'driver_imd_decile','vehicle_reference',
'vehicle_locationrestricted_lane','1st_road_number',
'2nd_road_number','driver_imd_decile',
'pedestrian_crossing-physical_facilities',
'pedestrian_crossing-human_control']]= df[['did_police_officer_attend_scene_of_accident',
'driver_imd_decile','vehicle_reference',
'vehicle_locationrestricted_lane','1st_road_number',
'2nd_road_number','driver_imd_decile',
'pedestrian_crossing-physical_facilities',
'pedestrian_crossing-human_control']].astype('object')
df.columns.to_series().groupby(df.dtypes).groups
df.isnull().sum().sort_values(ascending=False)/df.shape[0]*100
# #2nd_road_class
df['2nd_road_class'].value_counts()/df.shape[0]*100
With 40% of non null being unclassified and 39% of the overall 2nd_road_class column being null, I have decided to drop it in it's entirely.
df = df.drop(['2nd_road_class'], axis=1)
#driver_imd_decile
df['driver_imd_decile'].value_counts()/df.shape[0]*100
Since the distribution of categories for 'driver_imd_decile seem very similar, I've decided not to use the mode but "method='ffill'"
df['driver_imd_decile'].fillna(method='ffill', inplace=True)
df['age_of_vehicle'].describe()
df['age_of_vehicle'].median()
Changing the nulls of "age of vehicle" to median, then creating it as a category
#fillna by 7
df['age_of_vehicle'].fillna(7, inplace=True)
#group age_of_vehicle
#1=0-3, 2=3-5, 3=5-8, 4=8-11, 5=
def fixedvehicleage(age):
if age>=0 and age<=120:
return age
else:
return np.nan
df['age_of_vehicle'] = df['age_of_vehicle'].apply(fixedvehicleage)
df['age_of_vehicle'] = pd.cut(df['age_of_vehicle'],
[0,2,5,8,11,14,17,120], labels=['1', '2', '3','4','5','6','7'])
#model
df['model'].value_counts()/df.shape[0]*100
df['model'].describe()
Knowing that there are 28824 unique models for the model column I have decided to use the ffill method on it as well.
df['model'].fillna(method='ffill', inplace=True)
Note: A lot of the values of "model' are labeled as "missing". I do not want to change these because the model could have actually been missing from the car from the accident or it could not be recognizable at the time of the accident.
#engine_capacity_cc
df['engine_capacity_cc'].describe()
I am going to handle both outliers and the null values of engine_capacity_cc using the ideals of quantiles and the interquartile range (IQR).
#first I'm going to handle both ends of outliers.
#(determine the min and max cuttoffs for detecting the outlier)
q75, q25 = np.percentile(df['engine_capacity_cc'].dropna(), [75 ,25])
iqr = q75 - q25
ecmin = q25 - (iqr*1.5)
ecmax = q75 + (iqr*1.5)
print(ecmax)
print(ecmin)
To explain, what I am going to do is use the ecmax number for the maximum engine_capacity_cc and ecmin for my engine_capacity_cc. Then I'm going to take the mean of those and use it as my fillna.
df = df[df['engine_capacity_cc']<=ecmax]
df = df[df['engine_capacity_cc']>=ecmin]
df['engine_capacity_cc'].hist(bins=20)
plt.style.use('dark_background')
I can accept this distribution and will now check and handle their nulls
#check values of 'engine_capacity_cc'
df['engine_capacity_cc'].describe()
df['engine_capacity_cc'].mean()
Going to round this mean value
df['engine_capacity_cc'].fillna(1652, inplace=True)
Note: After doing the above null fixes, propulsion_code dropped from having 10% null values to 0. (see below). I will continue on and fix lsoa_of_accident_location then drop the rest of the null values with are all <5%.
df.isnull().sum().sort_values(ascending=False)/df.shape[0]*100
# #lsoa_of_accident_location
df['lsoa_of_accident_location'].value_counts()
df['lsoa_of_accident_location'].describe()
With 35061 unique variable and a high count amount the top variables I am deciding to do ffill again.
df['lsoa_of_accident_location'].fillna(method='ffill', inplace=True)
#### Check nulls for again
df.isnull().sum().sort_values(ascending=False)/df.shape[0]*100
Dropping the remaining nulls that are <1%.
#drop the remaining nulls that are <1%
df.dropna(inplace=True)
#last check
df.isnull().sum().sort_values(ascending=False)/df.shape[0]*100
df.shape
df.info()
#detecting outliers of numerical columns (all floats/ints excluding lat/long and year)
df_num = df[['engine_capacity_cc','number_of_casualties','number_of_vehicles','speed_limit']]
df_num.hist( bins=25, grid=False, figsize=(12,8))
plt.style.use('dark_background')
Column 'speed_limit' seems ok and was previously altered 'engine_capacity_cc'. However, 'number_of_casualties', and 'number_of_vehicles',will be evaluated.
# #number_of_casualties
df['number_of_casualties'].value_counts()
#create casualities grouping
def casualities(num_cas):
if num_cas >=1 and num_cas <2:
return "1"
elif num_cas >=2 and num_cas <3:
return "2"
elif num_cas >=3 and num_cas <4:
return "3"
elif num_cas >= 4 and num_cas <5:
return "4"
elif num_cas >= 5:
return "5+"
#apply function
df['number_of_casualties']= df['number_of_casualties'].apply(casualities)
#number_of_casualties
df['number_of_casualties'].value_counts()
df['propulsion_code'].value_counts()/df.shape[0]*100
#Clean the values for Propulsion Code.
df['propulsion_code'] = df['propulsion_code'].replace(to_replace="Gas", value="Petrol")
df['propulsion_code'] = df['propulsion_code'].replace(to_replace="Gas/Bi-fuel", value="Bio-fuel")
df['propulsion_code'] = df['propulsion_code'].replace(to_replace="Petrol/Gas (LPG)", value="LPG Petrol")
df['propulsion_code'] = df['propulsion_code'].replace(to_replace="Gas Diesel", value="Diesel")
df['propulsion_code'].value_counts()/df.shape[0]*100
# #unique values
df.nunique().sort_values(ascending=False)
df['date'] = pd.to_datetime(df['date'])
df['month'] = df ['date'].apply(lambda time: time.month)
#creating a weekend feature that includes Friday-Sunday
df['weekend']= np.where(df['day_of_week'].isin(['Friday', 'Saturday', 'Sunday']), 1, 0)
#create time of day feature with Morning Rush, Day, Noon Rush, Afternoon, After Work Rush, Night
#time of day dictionary
timeofdaygroups = {1: "Morning Rush (6-10)",
2: "Day (10-12)",
3: "Lunch Rush (12-14)",
4: "Afternoon (14-16)",
5: "After Work Rush (16-18)",
6: "Evening (18-22)",
7: "Night (22-6)"}
#pull time data and create hour column
df['hour'] = df['time'].str[0:2]
#convert to numeric
df['hour'] = pd.to_numeric(df['hour'])
#convert to integer
df['hour'] = df['hour'].astype('int')
#create time_of_day grouping
def daygroup(hour):
if hour >= 6 and hour < 10:
return "1"
elif hour >= 10 and hour < 12:
return "2"
elif hour >= 12 and hour < 14:
return "3"
elif hour >= 14 and hour < 16:
return "4"
elif hour >= 16 and hour < 18:
return "5"
elif hour >= 18 and hour < 22:
return "6"
else:
return "7"
#apply function
#time of day function
df['time_of_day']= df['hour'].apply(daygroup)
df[['weekend','day_of_week','time', 'time_of_day']].tail(10)
#vehicle_type
df['vehicle_type'].value_counts()/df.shape[0]*100
I want to condense the vehicle type variables.
#motorcycles
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Motorcycle over 500cc",
value="Motorcycle")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace=
"Motorcycle over 125cc and up to 500cc",
value="Motorcycle")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Motorcycle 125cc and under",
value="Motorcycle")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Motorcycle 50cc and under",
value="Motorcycle")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Electric motorcycle",
value="Motorcycle")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Motorcycle - unknown cc",
value="Motorcycle")
#Goods_vehicle
df['vehicle_type'] = df['vehicle_type'].replace(to_replace=
"Van / Goods 3.5 tonnes mgw or under",
value="Goods Vehicle")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Goods over 3.5t. and under 7.5t",
value="Goods Vehicle")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Goods vehicle - unknown weight",
value="Goods Vehicle")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Goods 7.5 tonnes mgw and over",
value="Goods Vehicle")
#car
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Taxi/Private hire car",
value="Car")
#bus
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Minibus (8 - 16 passenger seats)",
value="Bus")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace=
"Bus or coach (17 or more pass seats)",
value="Bus")
#other vehicle
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Agricultural vehicle",
value="Other Vehicle")
df['vehicle_type'] = df['vehicle_type'].replace(to_replace="Other vehicle",
value="Other Vehicle")
#vehicle_type
df['vehicle_type'].value_counts()/df.shape[0]*100
Create more condense groups for age band of driver in order to deal with some potential outliers.
#age_band_of_driver
df['age_band_of_driver'].value_counts()/df.shape[0]*100
#I did this before hand because as "Over 75", it wouldnt convert in the codes below
df['age_band_of_driver']=df['age_band_of_driver'].replace("Over 75","75-100")
age1 = ["0 - 5", "6 - 10", "11 - 15"]
age2 = ["16 - 20","21 - 25"]
age3 = ["26 - 35","36 - 45"]
age4 = ["46 - 55", "56 - 65"]
age5 = ["66 - 75", "75-100"]
#over 75 wouldnt work in the string so I did it separately
for (row, col) in df.iterrows():
if str.lower(col.age_band_of_driver) in age1:
df['age_band_of_driver'].replace(to_replace=col.age_band_of_driver,
value='Under 16', inplace=True)
if str.lower(col.age_band_of_driver) in age2:
df['age_band_of_driver'].replace(to_replace=col.age_band_of_driver,
value='16-25', inplace=True)
if str.lower(col.age_band_of_driver) in age3:
df['age_band_of_driver'].replace(to_replace=col.age_band_of_driver,
value='26-45', inplace=True)
if str.lower(col.age_band_of_driver) in age4:
df['age_band_of_driver'].replace(to_replace=col.age_band_of_driver,
value='46-65', inplace=True)
if str.lower(col.age_band_of_driver) in age5:
df['age_band_of_driver'].replace(to_replace=col.age_band_of_driver,
value='Over 65', inplace=True)
#age_band_of_driver
print("Distinct responses for age_band_of_driver:\n", set(df['age_band_of_driver']))
# number_of_vehicles
df['number_of_vehicles'].value_counts()/df.shape[0]*100
#group number_of_vehicles
def vehicles(num_veh):
if num_veh >=1 and num_veh <2:
return "1"
elif num_veh >=2 and num_veh <3:
return "2"
elif num_veh >=3 and num_veh <4:
return "3"
elif num_veh >= 4:
return "4+"
#apply function
df['number_of_vehicles']= df['number_of_vehicles'].apply(vehicles)
# number_of_vehicles
df['number_of_vehicles'].value_counts()/df.shape[0]*100
df['number_of_vehicles'].dtypes
df['number_of_vehicles']=df['number_of_vehicles'].astype('object')
#creating seasons column for ML
#creating season column
def getSeason(month):
if (month == 12 or month == 1 or month == 2):
return "winter"
elif(month == 3 or month == 4 or month == 5):
return "spring"
elif(month == 6 or month== 7 or month == 8):
return "summer"
else:
return "fall"
df['season'] = df['month'].apply(getSeason)
# number_of_vehicles
df['season'].value_counts()/df.shape[0]*100
#go back to engine capacity CC and crete groups
df.engine_capacity_cc.hist()
def enginecap(eng_cc):
if eng_cc <=1500:
return "small engine cc"
if eng_cc >1500 and eng_cc <=2000:
return "medium engine cc"
if eng_cc >2000:
return "large engine cc"
df['engine_capacity_cc_size'] = df['engine_capacity_cc'].apply(enginecap)
df.engine_capacity_cc_size.value_counts()
#Put above pickle in next full run
#create new column for Machine Learning and Visualization with Not Serious and Serious
df['accident_seriousness'] = df['accident_severity']
df['accident_seriousness'] = df['accident_seriousness'].replace(to_replace="Slight",
value="Not Serious")
df['accident_seriousness'] = df['accident_seriousness'].replace(to_replace="Serious",
value="Serious")
df['accident_seriousness'] = df['accident_seriousness'].replace(to_replace="Fatal",
value="Serious")
df.shape
df.accident_seriousness.value_counts()
#pickling everything to speed up restarting
df.to_pickle("df.pkl")
#import pickled file
df = pd.read_pickle("df.pkl")
df.head()
accidentsperyear = df.groupby(['year'])['accident_index'].count()
# prepare plot
plt.style.use('dark_background')
plt.figure(figsize=(10,5))
colors = sns.color_palette("plasma", n_colors=7)
sns.barplot(accidentsperyear.index,accidentsperyear.values, palette=colors)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.title("Accidents Per Year",fontsize=20,fontweight="bold")
plt.xlabel("\nYear", fontsize=15, fontweight="bold")
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
plt.savefig('accidentsperyear.png')
plt.tight_layout()
accidentspermonth = df.groupby(['month'])['accident_index'].count()
# prepare plot
plt.style.use('dark_background')
plt.figure(figsize=(20,10))
colors = sns.color_palette("plasma_r", n_colors=12)
mt=sns.barplot(accidentspermonth.index,accidentspermonth.values, palette=colors)
sns.despine(top=True, right=True, left=True, bottom=True)
#ax is the axes instance
group_labels = ['Jan', 'Feb','Mar','Apr','May','June','July','Aug','Sept','Oct','Nov','Dec' ]
mt.set_xticklabels(group_labels)
plt.title("Accidents Per Month",fontsize=20,fontweight="bold")
plt.xticks(fontsize=18)
plt.yticks(fontsize=12)
plt.xlabel("\nMonth", fontsize=15, fontweight="bold")
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
plt.savefig('accidentspermonth.png')
plt.tight_layout()
weekdays = ['Monday', 'Tuesday','Wednesday','Thursday', 'Friday', 'Saturday', 'Sunday']
accweekday = df.groupby(['year', 'day_of_week']).size()
accweekday = accweekday.rename_axis(['year', 'day_of_week'])\
.unstack('day_of_week')\
.reindex(columns=weekdays)
plt.figure(figsize=(15,10))
plt.style.use('dark_background')
sns.heatmap(accweekday, cmap='plasma_r')
plt.title('\nAccidents by Weekday per Year\n', fontsize=14, fontweight='bold')
plt.xticks(fontsize=15)
plt.yticks(fontsize=12)
plt.xlabel('')
plt.ylabel('')
plt.savefig('accidentsbyweekdayperyear.png')
plt.show()
Fridays are the day of the week where the most accidents occur.
accidentsperseason = df.groupby(['season'])['accident_index'].count()
seaord=['spring', 'summer', 'fall','winter']
# prepare plot
plt.style.use('dark_background')
plt.figure(figsize=(15,10))
sns.barplot(accidentsperseason.index,accidentsperseason.values, order=seaord,
saturation=1, palette='magma_r')
sns.despine(top=True, right=True, left=True, bottom=True)
plt.title("Accidents Per Season",fontsize=20,fontweight="bold")
plt.xticks(fontsize=15)
plt.yticks(fontsize=12)
plt.xlabel("\nSeason", fontsize=15, fontweight="bold")
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
plt.tight_layout()
plt.savefig('accidentsperseason.png')
#"Morning Rush (6-10)", "Day (10-12)", "Lunch Rush (12-14)","Afternoon (14-16)",
#"After Work Rush (16-18)", "Evening (18-22)", "Night (22-6)"
timeofdaygroups = {'1': "Morning Rush",
'2': "Day",
'3': "Lunch Rush",
'4': "Afternoon",
'5': "After Work Rush",
'6': "Evening",
'7': "Night"}
df['time_of_day']=df['time_of_day'].map(timeofdaygroups)
accidentspertod = df.groupby(['time_of_day'])['accident_index'].count()
# prepare plot
plt.style.use('dark_background')
plt.figure(figsize=(15,10))
tod=["Morning Rush", "Day", "Lunch Rush", "Afternoon",
"After Work Rush", "Evening", "Night"]
sns.barplot(accidentspertod.index,accidentspertod.values, order=tod, palette='rainbow')
sns.despine(top=True, right=True, left=True, bottom=True)
plt.title("Accidents Per Time of Day",fontsize=20,fontweight="bold")
plt.xticks(fontsize=15)
plt.yticks(fontsize=12)
plt.xlabel("", fontsize=15, fontweight="bold")
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
plt.tight_layout()
plt.savefig('accidentspertod.png')
%%HTML
<div class='tableauPlaceholder' id='viz1572176706313' style='position: relative'><noscript><a href='https://github.com/GenTaylor/Traffic-Accident-Analysis'><img alt=' ' src='https://public.tableau.com/static/images/Ac/AccidentForecasting/AccidentForecasting/1_rss.png' style='border: none' /></a></noscript><object class='tableauViz' style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='AccidentForecasting/AccidentForecasting' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https://public.tableau.com/static/images/Ac/AccidentForecasting/AccidentForecasting/1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div> <script type='text/javascript'> var divElement = document.getElementById('viz1572176706313'); var vizElement = divElement.getElementsByTagName('object')[0]; vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px'; var scriptElement = document.createElement('script'); scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js'; vizElement.parentNode.insertBefore(scriptElement, vizElement); </script>
According to the forcasting above, traffic accidents will be slightly lower than years before but following similar trends throughout the months.
For correlation I used both Pearson and Spearman just in case there would be discrepancies. The order may have slightly varied but the "highest" correlated remained the same.
#correlation by accident severity pearson
corrdf=df.apply(LabelEncoder().fit_transform)
sc = StandardScaler()
corrdf = sc.fit_transform(corrdf)
corrdf=pd.DataFrame(data=corrdf,columns=df.columns)
corr=corrdf.corr()['accident_seriousness']
corr[np.argsort(corr,axis=0)[::-1]]
corr_spear=corrdf.corr(method='spearman')['accident_seriousness']
corr_spear[np.argsort(corr_spear,axis=0)[::-1]]
Looking at this I wanted to visualize some of the higher pos/negative correlations against accident severity.
Before these visualizations were done, I wanted to be sure that the visualizations were of some importance to accident_seriousness. For this, the chi-squared test was used.
"""chisquare algorithm from
http://www.insightsbot.com/blog/2AeuRL/chi-square-feature-selection-in-python """
class ChiSquare:
def __init__(self, dataframe):
self.df = dataframe
self.p = None #P-Value
self.chi2 = None #Chi Test Statistic
self.dof = None
self.dfObserved = None
self.dfExpected = None
def _print_chisquare_result(self, colX, alpha):
result = ""
if self.p<alpha:
result="The column {0} is IMPORTANT for Prediction".format(colX)
else:
result="The column {0} is NOT an important predictor. (Discard {0} from model)".format(colX)
print(result)
def TestIndependence(self,colX,colY, alpha=0.05):
X = self.df[colX].astype(str)
Y = self.df[colY].astype(str)
self.dfObserved = pd.crosstab(Y,X)
chi2, p, dof, expected = stats.chi2_contingency(self.dfObserved.values)
self.p = p
self.chi2 = chi2
self.dof = dof
self.dfExpected = pd.DataFrame(expected, columns=self.dfObserved.columns,
index = self.dfObserved.index)
self._print_chisquare_result(colX,alpha)
#Initialize ChiSquare Class
cT = ChiSquare(df)
#Feature Selection
testColumns = ['accident_index', '1st_road_class', '1st_road_number','2nd_road_number',
'carriageway_hazards', 'date', 'day_of_week',
'did_police_officer_attend_scene_of_accident','junction_control',
'junction_detail', 'latitude', 'light_conditions', 'local_authority_district',
'local_authority_highway', 'longitude','lsoa_of_accident_location',
'number_of_casualties', 'number_of_vehicles', 'pedestrian_crossing-human_control',
'pedestrian_crossing-physical_facilities', 'police_force','road_surface_conditions',
'road_type', 'special_conditions_at_site', 'speed_limit', 'time',
'urban_or_rural_area', 'weather_conditions', 'year', 'inscotland',
'age_band_of_driver', 'age_of_vehicle', 'driver_home_area_type',
'driver_imd_decile', 'engine_capacity_cc','hit_object_in_carriageway',
'hit_object_off_carriageway', 'journey_purpose_of_driver', 'junction_location',
'make', 'model','propulsion_code', 'sex_of_driver', 'skidding_and_overturning',
'towing_and_articulation', 'vehicle_leaving_carriageway',
'vehicle_locationrestricted_lane', 'vehicle_manoeuvre','vehicle_reference',
'vehicle_type', 'was_vehicle_left_hand_drive', 'x1st_point_of_impact', 'month',
'weekend', 'hour', 'time_of_day','season', 'engine_capacity_cc_size']
for var in testColumns:
cT.TestIndependence(colX=var,colY="accident_seriousness" )
For my visualizations I have decided to use some of the features with the highest correlations to accident_seriousness:
Note: The columns used were selected because of the absolute value of their correlation in relation to accident_seriousness
*columns added after correlation was done after undersampling
For visual reasons, two separate dataframes were created, for not serious and serious accidents. I wanted to better scale the data and for me, this was the simplest way of doing so.
#dataframe where accidents are Slight
not_serious = df[(df['accident_seriousness']=="Not Serious")]
print("Not Serious Group Shape:", not_serious.shape)
not_serious.accident_seriousness.value_counts()
#dataframe where accidents are serious
serious= df[(df['accident_seriousness']=="Serious")]
print("Serious Group Shape:", serious.shape)
serious.accident_seriousness.value_counts()
#map 1, 2, 3 in did_police_officer_attend_scene_of_accident with Yes, No,Self-reported
policeattend = {1: "Yes", 2:"No", 3:"Self-Reported"}
not_serious['did_police_officer_attend_scene_of_accident']=not_serious['did_police_officer_attend_scene_of_accident'].map(policeattend)
df['did_police_officer_attend_scene_of_accident']=df['did_police_officer_attend_scene_of_accident'].map(policeattend)
serious['did_police_officer_attend_scene_of_accident']=serious['did_police_officer_attend_scene_of_accident'].map(policeattend)
imddecile = {1:"Most deprived 10%", 2:"More deprived 10-20%", 3:"More deprived 20-30%",
4:"More deprived 30-40%", 5:"More deprived 40-50%", 6:"Less deprived 40-50%",
7:"Less deprived 30-40%", 8:"Less deprived 20-30%", 9:"Less deprived 10-20%",
10:"Least deprived 10%"}
not_serious['driver_imd_decile']=not_serious['driver_imd_decile'].map(imddecile)
df['driver_imd_decile']=df['driver_imd_decile'].map(imddecile)
serious['driver_imd_decile']=serious['driver_imd_decile'].map(imddecile)
#setups for adding frequencies to visualizations
dftotal= float(len(df))
nstotal= float(len(not_serious))
setotal= float(len(serious))
#Did Police Officer Attend Scene Of Accident
plt.figure(figsize=(15,10))
ax = sns.countplot("did_police_officer_attend_scene_of_accident", hue="accident_seriousness",
palette="PuBu", data=not_serious)
plt.title("Did Police Officer Attend Scene Of Not Serious Accident",
fontsize=20, fontweight="bold")
plt.style.use('dark_background')
plt.xlabel("\nAttendance", fontsize=15, fontweight="bold")
plt.legend(fontsize=15, bbox_to_anchor=(1.0, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber Attended", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.3f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('did_police_officer_attend_scene_of_accident_not_serious.png')
plt.show()
#Did Police Officer Attend Scene Of Accident
plt.figure(figsize=(15,10))
ax = sns.countplot("did_police_officer_attend_scene_of_accident", hue="accident_seriousness",
palette="PuBu", data=serious)
plt.title("Did Police Officer Attend Scene Of Serious Accident",
fontsize=20, fontweight="bold")
plt.style.use('dark_background')
plt.xlabel("\nAttendance", fontsize=15, fontweight="bold")
plt.legend(fontsize=15, bbox_to_anchor=(1.0, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber Attended", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.3f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('did_police_officer_attend_scene_of_accident_serious.png')
plt.show()
# First Point of Impact Vs Accident Seriousness (Not Serious)
fpoa_order =["Front", "Nearside", "Did not impact", "Back", "Offside"]
plt.figure(figsize=(20,10))
ax = sns.countplot("x1st_point_of_impact", hue="accident_seriousness", order=fpoa_order,
palette="PuBu", data=not_serious)
plt.title("First Point of Impact in Not Serious Accidents",fontsize=20,fontweight="bold")
plt.style.use('dark_background')
plt.xlabel("\nPoint of Impact", fontsize=15, fontweight="bold")
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nFirst Point of Impact Count", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('x1st_point_of_impact_not_serious.png')
plt.show()
# First Point of Impact Vs Accident Seriousness
plt.figure(figsize=(20,10))
ax = sns.countplot("x1st_point_of_impact", hue="accident_seriousness", order=fpoa_order,
palette="PuBu", data=serious)
plt.title("First Point of Impact in Serious Accidents",fontsize=20,fontweight="bold")
plt.style.use('dark_background')
plt.xlabel("\nPoint of Impact", fontsize=15, fontweight="bold")
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nFirst Point of Impact Count", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('x1st_point_of_impact_serious.png')
plt.show()
#number of vehicles vs accidentseriousness
nov_order=["1","2", "3", "4+"]
#notserious
plt.figure(figsize=(20,10))
ax = sns.countplot("accident_seriousness", hue="number_of_vehicles", hue_order=nov_order,
palette="GnBu_d", data=not_serious)
plt.style.use('dark_background')
plt.title("Number of Vehicles in Not Serious Accidents",
fontsize=20, fontweight="bold")
plt.xlabel("\nNumber of Vehicles", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('number_of_vehicles_not_serious.png')
plt.show()
#serious
plt.figure(figsize=(20,10))
ax = sns.countplot("accident_seriousness", hue="number_of_vehicles", hue_order=nov_order,
palette="GnBu_d", data=serious)
plt.style.use('dark_background')
plt.title("Number of Vehicles in Serious Accidents",
fontsize=20, fontweight="bold")
plt.xlabel("\nNumber of Vehicles", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('number_of_vehicles_serious.png')
plt.show()
#notserious
splt_order=[15.0, 20.0,30.0,40.0 ,50.0,60.0, 70.0]
#splt1_order=[20.0,30.0,40.0 ,50.0,60.0, 70.0]
plt.figure(figsize=(20,10))
ax = sns.countplot("speed_limit", hue="accident_seriousness", order=splt_order,
palette="PuBu", data=not_serious)
plt.title("Speed Limit vs Not Serious Accidents",fontsize=20,fontweight="bold")
plt.style.use('dark_background')
plt.xlabel("\nSpeed Limits", fontsize=15, fontweight="bold")
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nCount", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.4f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('speed_limit_not_serious.png')
plt.show()
#erious
plt.figure(figsize=(20,10))
ax = sns.countplot("speed_limit", hue="accident_seriousness",
palette="PuBu", data=serious)
plt.title("Speed Limit vs Serious Accidents",fontsize=20,fontweight="bold")
plt.style.use('dark_background')
plt.xlabel("\nSpeed Limits", fontsize=15, fontweight="bold")
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nCount", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.3f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('speed_limit_serious.png')
plt.show()
#urban_or_rural_area vs accident seriousness
plt.figure(figsize=(20,10))
ax = sns.countplot("accident_seriousness", hue="urban_or_rural_area",
palette="PuBu", data=not_serious)
plt.title("Urban or Rural Area vs Accident Severity",fontsize=20,fontweight="bold")
plt.style.use('dark_background')
plt.xlabel("\nSeverity", fontsize=15, fontweight="bold")
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nUrban or Rural Area Count", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('urban_or_rural_area_not_serious.png')
plt.show()
#urban_or_rural_area vs accident seriousness
plt.figure(figsize=(20,10))
ax = sns.countplot("accident_seriousness", hue="urban_or_rural_area",
palette="PuBu", data=serious)
plt.title("Urban or Rural Area vs Accident Severity",fontsize=20,fontweight="bold")
plt.style.use('dark_background')
plt.xlabel("\nSeverity", fontsize=15, fontweight="bold")
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nUrban or Rural Area Count", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('urban_or_rural_area_serious.png')
plt.show()
#Not Serious Accident
sao_order=["None", "Skidded", "Skidded and overturned", "Overturned", "Jackknifed",
"Jackknifed and overturned"]
plt.figure(figsize=(15,10))
ax = sns.countplot("accident_seriousness", hue="skidding_and_overturning", hue_order=sao_order,
palette="magma", data=not_serious)
plt.style.use('dark_background')
plt.title("Skidding and Overturning in Not Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Skidding and Overturning", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.3f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('skidding_and_overturning_not_serious.png')
plt.show()
#Serious Accident Manuevers
plt.figure(figsize=(15,10))
ax= sns.countplot("accident_seriousness", hue="skidding_and_overturning", hue_order=sao_order,
palette="magma", data=serious)
plt.style.use('dark_background')
plt.title("Skidding and Overturning in Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Skidding and Overturning", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.3f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('skidding_and_overturning_serious.png')
plt.show()
#Not Serious Accident Manuevers
vlc_order=["Did not leave carriageway", "Straight ahead at junction", "Nearside",
"Offside", "Offside on to central reservation", "Nearside and rebounded",
"Offside - crossed central reservation", "Offside and rebounded",
"Offside on to centrl res + rebounded"]
plt.figure(figsize=(15,10))
ax=sns.countplot("accident_seriousness", hue="vehicle_leaving_carriageway", hue_order=vlc_order,
palette="plasma", data=not_serious)
plt.style.use('dark_background')
plt.title("Vehicle Leaving Carriageway in Not Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Vehicle Leaving Carriageway ", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents\n", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.3f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('vehicle_leaving_carriageway_not_serious.png')
plt.show()
#Serious Accident Manuevers
plt.figure(figsize=(15,10))
ax=sns.countplot("accident_seriousness", hue="vehicle_leaving_carriageway", hue_order=vlc_order,
palette="plasma", data=serious)
plt.style.use('dark_background')
plt.title("Vehicle Leaving Carriageway in Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Vehicle Leaving Carriageway ", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents\n", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.3f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('vehicle_leaving_carriageway_serious.png')
plt.show()
#sex_of_driver
sod_order=["Female", "Male", "Not known"]
plt.figure(figsize=(15,10))
ax=sns.countplot("accident_seriousness", hue="sex_of_driver", hue_order=sod_order,
palette="magma", data=not_serious)
plt.style.use('dark_background')
plt.title("Sex of Driver in Not Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nSex of Driver", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('sex_of_driver_not_serious.png')
plt.show()
#sex_of_driver serious
plt.figure(figsize=(15,10))
ax=sns.countplot("accident_seriousness", hue="sex_of_driver", hue_order=sod_order,
palette="magma", data=serious)
plt.style.use('dark_background')
plt.title("Sex of Driver in Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nSex of Driver", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('sex_of_driver_serious.png')
plt.show()
#sex_of_driver
df['sex_of_driver'].value_counts()/df.shape[0]*100
#Not Serious Accident Type
vt_order=['Bus', 'Car', 'Goods Vehicle', 'Motorcycle', 'Other Vehicle']
plt.figure(figsize=(15,10))
ax=sns.countplot("accident_seriousness", hue="vehicle_type", hue_order=vt_order,
palette="tab20", data=not_serious)
plt.style.use('dark_background')
plt.title("Vehicle Type in Not Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accidents by Vehicle Type", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('vehicle_type_not_serious.png')
plt.show()
#Serious Accident Type
plt.figure(figsize=(15,10))
ax=sns.countplot("accident_seriousness", hue="vehicle_type", hue_order=vt_order,
palette="tab20", data=serious)
plt.style.use('dark_background')
plt.title("Vehicle Type in Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accidents by Vehicle Type", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('vehicle_type_serious.png')
plt.show()
#Not Serious Accident Manuevers
vm_order=['Turning right', 'Going ahead other', 'Going ahead right-hand bend',
'Slowing or stopping', 'Turning left', 'Waiting to go - held up',
'Waiting to turn right', 'Overtaking static vehicle - offside' ,
'Parked', 'Overtaking - nearside', 'U-turn', 'Changing lane to right',
'Reversing', 'Waiting to turn left', 'Changing lane to left',
'Going ahead left-hand bend', 'Overtaking moving vehicle - offside', 'Moving off']
plt.figure(figsize=(20,10))
ax=sns.countplot("accident_seriousness", hue="vehicle_manoeuvre", hue_order=vm_order,
palette="tab20", data=not_serious)
plt.style.use('dark_background')
plt.title("Vehicle Manuevers in Not Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Vehicle Manuevers", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('vehicle_manoeuvre_not_serious.png')
plt.show()
#Serious Accident Manuevers
plt.figure(figsize=(20,10))
ax=sns.countplot("accident_seriousness", hue="vehicle_manoeuvre",hue_order=vm_order,
palette="tab20", data=serious)
plt.style.use('dark_background')
plt.title("Vehicle Manuevers in Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Vehicle Manuevers", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('vehicle_manoeuvre_serious.png')
plt.show()
#driver_home_area_type
dhoa_order=['Urban area', 'Rural', 'Small town']
#Serious Accident Driver Home Type Area
plt.figure(figsize=(20,15))
ax= sns.countplot("accident_seriousness", hue="driver_home_area_type", hue_order=dhoa_order,
palette="rainbow", data=not_serious)
plt.style.use('dark_background')
plt.title("Accident Driver Home Type Area in Not Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nSeriousness", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
#plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('driver_home_area_type_not_serious.png')
plt.show()
#driver_home_area_type
#Serious Accident Driver Home Type Area
plt.figure(figsize=(20,15))
ax= sns.countplot("accident_seriousness", hue="driver_home_area_type", hue_order=dhoa_order,
palette="rainbow", data=serious)
plt.style.use('dark_background')
plt.title("Accident Driver Home Type Area in Serious Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nSeriousness", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
#plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('driver_home_area_type_serious.png')
plt.show()
#age_band_of_driver
abod_order=['Under 16', '16-25', '26-45', '46-65','Over 65']
#Not Serious Accident age_band_of_driver
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="age_band_of_driver", hue_order=abod_order,
palette="magma", data=not_serious)
plt.style.use('dark_background')
plt.title("Not Serious Accident by Age Band of Driver",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accident by Age Band of Driver", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
#plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('age_band_of_driver_not_serious.png')
plt.show()
#Serious Accident age_band_of_driver
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="age_band_of_driver", hue_order=abod_order,
palette="magma", data=serious)
plt.style.use('dark_background')
plt.title("Serious Accident by Age Band of Driver",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accident by Age Band of Driver", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
#plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('age_band_of_driver_serious.png')
plt.show()
#junction_control
jc_order = ['Give way or uncontrolled', 'Auto traffic signal', 'Authorised person',
'Stop sign','Not at junction or within 20 metres']
#Not Serious Accident junction_control
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="junction_control", hue_order=jc_order,
palette="magma", data=not_serious)
plt.style.use('dark_background')
plt.title("Not Serious Accident by Junction Control",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accident by Junction Control", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('junction_control_not_serious.png')
plt.show()
#Serious Accident junction_control
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="junction_control",hue_order=jc_order,
palette="magma", data=serious)
plt.style.use('dark_background')
plt.title("Serious Accident by Junction Control",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accident by Junction Control", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('junction_control_serious.png')
plt.show()
#hit_object_off_carriageway
hooffc_order=['None', 'Lamp post', 'Road sign or traffic signal', 'Other permanent object',
'Entered ditch', 'Tree', 'Near/Offside crash barrier','Central crash barrier',
'Bus stop or bus shelter', 'Telegraph or electricity pole', 'Submerged in water',
'Wall or fence']
#Not Serious Accident hit_object_off_carriageway
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="hit_object_off_carriageway", hue_order=hooffc_order,
palette="plasma", data=not_serious)
plt.style.use('dark_background')
plt.title("Not Serious Accident by Hit Object Off Carriageway",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accident by Hit Object Off Carriageway", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('hit_object_off_carriageway_not_serious.png')
plt.show()
#Serious Accident hit_object_off_carriageway
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="hit_object_off_carriageway", hue_order=hooffc_order,
palette="plasma", data=serious)
plt.style.use('dark_background')
plt.title("Serious Accident by Hit Object Off Carriageway",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accident by Hit Object Off Carriageway", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('hit_object_off_carriageway_serious.png')
plt.show()
#hit_object_in_carriageway
hoinc_order=['None', 'Kerb', 'Other object', 'Bollard or refuge', 'Parked vehicle',
'Road works', 'Open door of vehicle', 'Central island of roundabout',
'Previous accident', 'Bridge (side)', 'Any animal (except ridden horse)',
'Bridge (roof)']
#Not Serious Accident hit_object_in_carriageway
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="hit_object_in_carriageway", hue_order=hoinc_order,
palette="plasma", data=not_serious)
plt.style.use('dark_background')
plt.title("Not Serious Accident by Hit Object in Carriageway",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accident by Hit Object in Carriageway", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('hit_object_in_carriageway_not_serious.png')
plt.show()
#Serious Accident hit_object_in_carriageway
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="hit_object_in_carriageway", hue_order=hoinc_order,
palette="plasma", data=serious)
plt.style.use('dark_background')
plt.title("Serious Accident by Hit Object in Carriageway",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accident by Hit Object in Carriageway", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('hit_object_in_carriageway_serious.png')
plt.show()
#driver_imd_decile
imd_order=["Least deprived 10%", "Less deprived 10-20%", "Less deprived 20-30%",
"Less deprived 30-40%","Less deprived 40-50%","Most deprived 10%",
"More deprived 10-20%", "More deprived 20-30%", "More deprived 30-40%",
"More deprived 40-50%"]
#Not Serious Accident driver_imd_decile
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="driver_imd_decile", hue_order=imd_order,
palette="plasma", data=not_serious)
plt.style.use('dark_background')
plt.title("Not Serious Accident by Driver Area Deprivation Score",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accident by Driver Area Deprivation Score", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('driver_imd_decile_not_serious.png')
plt.show()
#Serious Accident driver_imd_decile
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="driver_imd_decile", hue_order=imd_order,
palette="plasma", data=serious)
plt.style.use('dark_background')
plt.title("Serious Accident by Driver Area Deprivation Score",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accident by Driver Area Deprivation Score", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('driver_imd_decile_serious.png')
plt.show()
#junction_detail
jud_order=['T or staggered junction', 'Mini-roundabout', 'Crossroads',
'Private drive or entrance', 'More than 4 arms (not roundabout)',
'Roundabout', 'Slip road', 'Other junction','Not at junction or within 20 metres']
#Not Serious Accident junction_detail
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="junction_detail", hue_order=jud_order,
palette="plasma", data=not_serious)
plt.style.use('dark_background')
plt.title("Not Serious Accident by Junction Detail",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accident by Junction Detail", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('junction_detail_not_serious.png')
plt.show()
#Serious Accident junction_detail
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="junction_detail", hue_order=jud_order,
palette="plasma", data=serious)
plt.style.use('dark_background')
plt.title("Serious Accident by Junction Detail",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accident by Junction Detail", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('junction_detail_serious.png')
plt.show()
#junction_location
jul_order=['Mid Junction - on roundabout or on main road', 'Entering main road',
'Approaching junction or waiting/parked at junction approach',
'Cleared junction or waiting/parked at junction exit', 'Leaving main road',
'Leaving roundabout', 'Entering roundabout', 'Entering from slip road',
'Not at or within 20 metres of junction']
#Not Serious Accident junction_location
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="junction_location", hue_order=jul_order,
palette="plasma", data=not_serious)
plt.style.use('dark_background')
plt.title("Not Serious Accident by Junction Location",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accident by Junction Location", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('junction_location_not_serious.png')
plt.show()
#Serious Accident junction_location
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="junction_location", hue_order=jul_order,
palette="plasma", data=serious)
plt.style.use('dark_background')
plt.title("Serious Accident by Junction Location",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accident by Junction Location", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('junction_location_serious.png')
plt.show()
#propulsion_code
pd_order=['Petrol', 'Heavy oil', 'Hybrid electric', 'Bio-fuel', 'LPG Petrol', 'Diesel',
'Fuel cells', 'New fuel technology', 'Electric diesel']
pd_order2=['Petrol', 'Heavy oil', 'Hybrid electric', 'Bio-fuel', 'LPG Petrol', 'Electric diesel']
#Not Serious Accident propulsion_code
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="propulsion_code", hue_order=pd_order,
palette="plasma", data=not_serious)
plt.style.use('dark_background')
plt.title("Not Serious Accident by Propulsion Code",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accident by Propulsion Code", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('propulsion_code_not_serious.png')
plt.show()
#Serious Accident propulsion_code
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="propulsion_code", hue_order=pd_order2,
palette="plasma", data=serious)
plt.style.use('dark_background')
plt.title("Serious Accident by Propulsion Code",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accident by Propulsion Code", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('propulsion_code_serious.png')
plt.show()
#yeare
year_order=[2010, 2011, 2012, 2013, 2014, 2015, 2016]
#Not Serious Accident yeare
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="year", hue_order=year_order,
palette="plasma", data=not_serious)
plt.style.use('dark_background')
plt.title("Not Serious Accident by Year",fontsize=25,fontweight="bold")
plt.xlabel("\nNot Serious Accident by Year", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/nstotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('year_not_serious.png')
plt.show()
#Serious Accident year
plt.figure(figsize=(20,15))
ax=sns.countplot("accident_seriousness", hue="year", hue_order=year_order,
palette="plasma", data=serious)
plt.style.use('dark_background')
plt.title("Serious Accident by Year",fontsize=25,fontweight="bold")
plt.xlabel("\nSerious Accident by Year", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
for p in ax.patches:
height = p.get_height()
ax.text(p.get_x()+p.get_width()/2.,
height + 3,
'{:1.2f}%'.format(height/setotal*100),
ha="center",fontsize=15)
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=True)
plt.savefig('year_serious.png')
plt.show()
Due to the previous visualizations a comparison of certain variables was desired to see more correlations.
#Not Serious Accident
plt.figure(figsize=(20,15))
ax=sns.countplot("junction_control", hue="junction_detail",
palette="plasma", data=df)
plt.style.use('dark_background')
plt.title("Junction Control by Junction Detail",fontsize=25,fontweight="bold")
plt.xlabel("\nAccident by Year", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
# plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=False)
plt.savefig('junction_control_by_junction_detail.png')
plt.show()
plt.figure(figsize=(20,15))
ax=sns.countplot("junction_control", hue="junction_location",
palette="plasma", data=df)
plt.style.use('dark_background')
plt.title("Junction Control by Junction Location in Accidents",fontsize=25,fontweight="bold")
plt.xlabel("\nAccident by Year", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
# plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=False)
plt.savefig('junction_control_by_junction_location.png')
plt.show()
plt.figure(figsize=(20,15))
ax=sns.countplot("x1st_point_of_impact", hue="junction_detail",
palette="plasma", data=df)
plt.style.use('dark_background')
plt.title("First point of Impact by Junction Detail",fontsize=25,fontweight="bold")
plt.xlabel("\nAccident by Year", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
# plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=False)
plt.savefig('x1st_point_of_impact_by_junction_detail.png')
plt.show()
plt.figure(figsize=(20,15))
ax=sns.countplot("x1st_point_of_impact", hue="junction_location",
palette="plasma", data=df)
plt.style.use('dark_background')
plt.title("First point of Impact by Junction Location",fontsize=25,fontweight="bold")
plt.xlabel("\nAccident by Year", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
# plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=False)
plt.savefig('x1st_point_of_impact_by_junction_location.png')
plt.show()
plt.figure(figsize=(20,15))
ax=sns.countplot("x1st_point_of_impact", hue="junction_control",
palette="plasma", data=df)
plt.style.use('dark_background')
plt.title("First point of Impact by Junction Control",fontsize=25,fontweight="bold")
plt.xlabel("\nAccident by Year", fontsize=15, fontweight="bold")
plt.legend().set_title('')
plt.legend(fontsize='22', loc = 'upper right')
plt.ylabel("\nNumber of Accidents", fontsize=15, fontweight="bold")
plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
plt.legend(fontsize='15', bbox_to_anchor=(1.04, 1), loc='upper right', ncol=1)
# plt.tick_params(axis='x', which='both', bottom=False, top=False, labelbottom=False)
sns.despine(top=True, right=True, left=True, bottom=False)
plt.savefig('x1st_point_of_impact_by_junction_control.png')
plt.show()
No matter the situation above, the most accidents were involving areas that were uncontrolled. One of the main ones were the junction Detail T or staggered junction.
Other areas of concern include Mid Junctions on roundabouts or main roads and areas approaching a junction were cars were either parking or waiting in the junction.
From the data above more controlled areas would be benefical. Maybe signs alerting drivers of the upcoming junctions, traffic lights, or stop signs would help in some of these areas where they are feasible.
For example, this is a staggered junction, the main junction detail in accidents. One can understand how a situation such as these can lead to numerous accidents especially if proper signage is not available. Perhaps traffic lights, stop signs, or warnings indicating that they are approaching certain junctions would help reduce accidents.
Below you wll find a web scrape of the website, Learner Driving Centres, which contains information on road signs in the UK. They were pulled to show examples of signage available to be placed.
#request website
r = requests.get('https://www.learnerdriving.com/learn-to-drive/highway-code/road-signs')
#parse HTML
soup = BeautifulSoup(r.text, 'html.parser')
#filter results
results = soup.find_all('div', attrs={'class':'fifth'})
#done to find specific results area
first_result=results[0]
first_result
first_result.find('img')['src']
#get images of signs and sign descriptions
signage = []
for result in results:
sign=result.find('img')['src']
sign_desc=result.contents[1]
signage.append((sign, sign_desc))
#put pulled UK Traffic Signs into dataframe
uktrafficsigns = pd.DataFrame(signage, columns=['Sign', 'Sign Description'])
uktrafficsigns.head()
'''
the "image" is just part of the image link,
must parse the first half in order to have full image link
'''
uktrafficsigns['Sign'] = 'https://www.learnerdriving.com/'+uktrafficsigns['Sign']
uktrafficsigns.head()
'''
In some coding below I saw that one of the fields was blank (at index 42) but was not reading as null.
In order to fix that I changed the "Sign Description" and decided to place it here.
'''
uktrafficsigns.at[42,'Sign Description']="T-junction with priority over vehicles from the right"
#I wanted to save this as a csv for later, and to stop unnecessary web scraping
uktrafficsigns.to_csv('uktrafficsigns.csv', header=False, index=False)
#I wanted the html to show up as images instead of links
def path_to_image_html(path):
return '<img src="'+ path + '" width="60" >'
pd.set_option('display.max_colwidth', -1)
ukts=HTML(uktrafficsigns.to_html(escape=False ,formatters=dict(Sign=path_to_image_html)))
HTML(uktrafficsigns.to_html(escape=False ,formatters=dict(Sign=path_to_image_html)))
'''
Here I am creating a df that will allow me to pull all junction signs.
"ction" was used instead of "junction" in order to pull all variables.
'''
junction =uktrafficsigns[uktrafficsigns['Sign Description'].str.contains("nction", regex=False)]
#Making it its own HTML object (same as above)
def path_to_image_html(path):
return '<img src="'+ path + '" width="60" >'
pd.set_option('display.max_colwidth', -1)
HTML(junction.to_html(escape=False ,formatters=dict(Sign=path_to_image_html)))
#Repeated the above steps for giveways
give=uktrafficsigns[uktrafficsigns['Sign Description'].str.contains("ive ", regex=False)]
def path_to_image_html(path):
return '<img src="'+ path + '" width="60" >'
pd.set_option('display.max_colwidth', -1)
HTML(give.to_html(escape=False ,formatters=dict(Sign=path_to_image_html)))
#roundabouts
roundabout=uktrafficsigns[uktrafficsigns['Sign Description'].str.contains("ounda", regex=False)]
def path_to_image_html(path):
return '<img src="'+ path + '" width="60" >'
pd.set_option('display.max_colwidth', -1)
HTML(roundabout.to_html(escape=False ,formatters=dict(Sign=path_to_image_html)))
Below we used Tableau to map what could be deemed problem areas for the UK. These are accidents in areas with high deprivation (driver_imd_decile @ more deprived 40-50%) and no signange at T or staggered junctions.
%%HTML
<div class='tableauPlaceholder' id='viz1572177057382' style='position: relative'><noscript><a href='https://github.com/GenTaylor/Traffic-Accident-Analysis'><img alt=' ' src='https://public.tableau.com/static/images/Ac/AccidentForecasting/SeriousAccidentsinAreaswithHighDeprivationandNoSignage/1_rss.png' style='border: none' /></a></noscript><object class='tableauViz' style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='AccidentForecasting/SeriousAccidentsinAreaswithHighDeprivationandNoSignage' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https://public.tableau.com/static/images/Ac/AccidentForecasting/SeriousAccidentsinAreaswithHighDeprivationandNoSignage/1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div> <script type='text/javascript'> var divElement = document.getElementById('viz1572177057382'); var vizElement = divElement.getElementsByTagName('object')[0]; vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px'; var scriptElement = document.createElement('script'); scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js'; vizElement.parentNode.insertBefore(scriptElement, vizElement); </script>
#made separate dataframe w. set index that wouldnt effect data vis above
df1=df
#set index to accident_index
df1.set_index('accident_index', inplace=True)
df1.head()
df1 = df1.drop(['accident_severity'],axis=1)
df1.head()
print(df1.columns)
#separate dtypes
notif=df1.select_dtypes(exclude=['int','float','int64'])
intfldtypes = df1.select_dtypes(include=['int','float','int64'])
print('Objects',notif.columns)
print("\nNonObjects",intfldtypes.columns)
#checking to make sure all are accounted for
print(df1.shape)
print(notif.shape)
print(intfldtypes.shape)
Label Encoder was used instead of OneHotEncoder due to the memory errors One Hot Encoder caused in the data. The algorithms used will be classifiers, through boosting and trees, and not linear.
#label encode objects
obj_le= notif.apply(LabelEncoder().fit_transform)
#re-add with non-objects
df_ml= pd.concat([obj_le,intfldtypes], axis=1, sort=False)
#check shape
print(df_ml.shape)
#Set up of X and Y
X= df_ml.drop(['accident_seriousness'],axis=1)
y= df_ml['accident_seriousness']
df_ml.accident_seriousness.value_counts()
The data in this dataset is extremely imbalanced for what we are trying to predict. We are going to resample the data as undersampling, where we reduce the number of majority (Not Serious Accidents) samples.
The machine learning classifier algorithms that we are going to use are as follows:
*Gradient Boosting was commented out because of the time it took to run (18hrs) and not having relevant enough results to still consider.
# setting up testing and training sets
res_X_train, res_X_test, res_y_train, res_y_test = train_test_split(X, y,
test_size=0.25, random_state=27)
# concatenate our training data back together
res_X = pd.concat([res_X_train, res_y_train], axis=1)
# separate minority and majority classes
not_severe = res_X[res_X.accident_seriousness==0]
severe = res_X[res_X.accident_seriousness==1]
# decrease majority
not_severe_decreased = resample(not_severe,
replace=True, # sample with replacement
n_samples=len(severe), # match number in majority class
random_state=27) # reproducible results
# combine majority and severe_increased minority
newdf = pd.concat([severe, not_severe_decreased])
newdf.accident_seriousness.value_counts()
res_X_train = newdf.drop('accident_seriousness', axis=1)
res_y_train = newdf.accident_seriousness
Before, we get in to predictions, we are going to complete some machine learning in ordered to see how the data relates to each other. We are going to do this on the resampled data as well, in order to avoid bias. We will use two clusters which, in theory, represent the two variables for accident_seriousness, Not Serious & Serious
# "clustering" using kmode algorithm that is designed to handle mixed data
km_huang = KModes(n_clusters=2, init = "Huang", n_init = 1)
fitClusters_huang = km_huang.fit_predict(newdf)
fitClusters_huang
newdf1 = newdf.copy().reset_index()
clustersDf = pd.DataFrame(fitClusters_huang)
clustersDf.columns = ['cluster_predicted']
combinedDf = pd.concat([newdf1, clustersDf], axis = 1).reset_index()
combinedDf = combinedDf.drop(['index'], axis = 1)
combinedDf.head()
#plotting a few of these features just to see how they relate to the clustering for seriousness
f, axs = plt.subplots(1,3,figsize = (15,8))
sns.countplot(x=combinedDf['did_police_officer_attend_scene_of_accident'],
order=combinedDf['did_police_officer_attend_scene_of_accident'].value_counts().index,
hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[0])
sns.countplot(x=combinedDf['x1st_point_of_impact'],
order=combinedDf['x1st_point_of_impact'].value_counts().index,
hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[1])
sns.countplot(x=combinedDf['number_of_vehicles'],
order=combinedDf['number_of_vehicles'].value_counts().index,
hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[2])
plt.tight_layout()
plt.savefig('clusterplot1.png')
plt.show()
f, axs = plt.subplots(1,3,figsize = (15,8))
sns.countplot(x=combinedDf['speed_limit'],
order=combinedDf['speed_limit'].value_counts().index,
hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[0])
sns.countplot(x=combinedDf['urban_or_rural_area'],
order=combinedDf['urban_or_rural_area'].value_counts().index,
hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[1])
sns.countplot(x=combinedDf['skidding_and_overturning'],
order=combinedDf['skidding_and_overturning'].value_counts().index,
hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[2])
plt.tight_layout()
plt.savefig('clusterplot2.png')
plt.show()
f, axs = plt.subplots(1,3,figsize = (15,8))
sns.countplot(x=combinedDf['vehicle_leaving_carriageway'],
order=combinedDf['vehicle_leaving_carriageway'].value_counts().index,
hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[0])
sns.countplot(x=combinedDf['sex_of_driver'],
order=combinedDf['sex_of_driver'].value_counts().index,
hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[1])
sns.countplot(x=combinedDf['vehicle_type'],
order=combinedDf['vehicle_type'].value_counts().index,
hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[2])
plt.tight_layout()
plt.savefig('clusterplot3.png')
plt.show()
f, axs = plt.subplots(1,3,figsize = (15,8))
sns.countplot(x=combinedDf['junction_control'],
order=combinedDf['junction_control'].value_counts().index,
hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[0])
sns.countplot(x=combinedDf['number_of_casualties'],
order=combinedDf['number_of_casualties'].value_counts().index,
hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[1])
sns.countplot(x=combinedDf['age_band_of_driver'],
order=combinedDf['age_band_of_driver'].value_counts().index,
hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[2])
plt.tight_layout()
plt.savefig('clusterplot4.png')
plt.show()
f, axs = plt.subplots(1,3,figsize = (15,8))
sns.countplot(x=combinedDf['junction_detail'],
order=combinedDf['junction_detail'].value_counts().index,
hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[0])
sns.countplot(x=combinedDf['junction_location'],
order=combinedDf['junction_location'].value_counts().index,
hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[1])
sns.countplot(x=combinedDf['driver_imd_decile'],
order=combinedDf['driver_imd_decile'].value_counts().index,
hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[2])
plt.tight_layout()
plt.savefig('clusterplot5.png')
plt.show()
f, axs = plt.subplots(1,3,figsize = (15,8))
sns.countplot(x=combinedDf['junction_detail'],
order=combinedDf['junction_detail'].value_counts().index,
hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[0])
sns.countplot(x=combinedDf['junction_location'],
order=combinedDf['junction_location'].value_counts().index,
hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[1])
sns.countplot(x=combinedDf['driver_imd_decile'],
order=combinedDf['driver_imd_decile'].value_counts().index,
hue=combinedDf['cluster_predicted'], palette='PuBu', ax=axs[2])
plt.tight_layout()
plt.savefig('clusterplot6.png')
plt.show()
Looking at these graphs we can see the patterns of how each category of eacch column pairs off with the clustering on accident_seriousness.
#start timming
start_bagc_res = time.time()
#Resampled Bagging Classifier
bagc_res = BaggingClassifier(max_features=X.shape[1], n_estimators=500, random_state=42)
bagc_res.fit(res_X_train, res_y_train)
pred_bagc_res = bagc_res.predict(res_X_test)
#Check Scores
print("Resampled Bagging Classifier Accuracy Score: {:0.2f}%".format(accuracy_score(res_y_test,
pred_bagc_res )*100))
print("Resampled Bagging Classifier F1 Score: {:0.2f}%".format(f1_score(res_y_test,
pred_bagc_res,average="macro")*100))
print("Resampled Bagging Classifier Precision Score: {:0.2f}%".format(precision_score(res_y_test,
pred_bagc_res,
average="macro")*100))
print("Resampled Bagging Classifier Recall Score: {:0.2f}%".format(recall_score(res_y_test,
pred_bagc_res,
average="macro")*100))
print("Resampled Bagging Classifier Cross Validation Score: {:0.2f}%"
.format(np.mean(cross_val_score(bagc_res, res_X_train, res_y_train, cv=5)*100)))
print('\n')
# Creates a confusion matrix
bagc_res_cm = confusion_matrix(res_y_test,pred_bagc_res)
# Transform to df for easier plotting
bagc_res_cm_df = pd.DataFrame(bagc_res_cm,
index = ['Not Serious','Serious'],
columns = ['Not Serious','Serious'])
plt.figure(figsize=(15,5))
sns.heatmap(bagc_res_cm_df, annot=True, fmt="d", cmap='viridis', linecolor='black', linewidths=1)
plt.title('Resampled Bagging Classifier Accuracy: {0:.2f}%'
.format(accuracy_score(res_y_test,pred_bagc_res )*100),fontsize=15)
plt.ylabel('Actual\n')
plt.xlabel('Predicted\n')
plt.show()
#end time
end_bagc_res = time.time()
print("Resampled Bagging Classifier Time:", end_bagc_res - start_bagc_res)
#extracting true_positives, false_positives, true_negatives, false_negatives
tn, fp, fn, tp = confusion_matrix(res_y_test,pred_bagc_res).ravel()
accuracy = accuracy_score(res_y_test,pred_bagc_res )*100
specificity = tn/(tn+fp)*100
fpr = fp/(tn+fp)*100
ers = 100-accuracy
print("Resampled Bagging Classifier Specificity Score: {0:.2f}%".format(specificity))
print("Resampled Bagging Classifier False Positive Rate Score: {0:.2f}%".format(fpr))
print("Resampled Bagging Classifier Error Rate Score: {0:.2f}%".format(ers))
print("Resampled Bagging Classifier Roc Auc Score: {0:.2f}%"
.format(roc_auc_score(res_y_test,pred_bagc_res)*100))
AdaBoost is a boosting algorithm and is widely used to process imbalanced data. It uses a single-layer decision tree as a weak classifier. In each training iteration, the weight of the misclassified samples generated by the previous iteration is increased, and the weight of the correctly classified samples is reduced, improving the significance of the misclassified samples in the next iteration. Although the AdaBoost algorithm can be directly used to process imbalanced data, the algorithm focuses more on the misclassified samples than samples of minority class. In addition, it may generate many redundant or useless weak classifiers, increasing the processing overhead and causing performance reduction.
With this being said, we will use AdaBoost on the resampled set and below for the class_weight sets we will use it regularly to see how it handles the imbalanced data on it's own vs resampling.
See: Improved PSO_AdaBoost Ensemble Algorithm for Imbalanced Data
#start
start_res_adbc = time.time()
#Resampled AdaBoost Classifier
res_adbc = AdaBoostClassifier( n_estimators=500, learning_rate=0.05, random_state=42)
res_adbc.fit(res_X_train, res_y_train)
pred_res_adbc = res_adbc.predict(res_X_test)
#Check scores
print("Resampled AdaBoost Classifier Cross Validation Score: {:0.2f}%"
.format(np.mean(cross_val_score(res_adbc, res_X_train, res_y_train, cv=3)*100)))
print('\n')
# Creates a confusion matrix
res_adbc_cm = confusion_matrix(res_y_test,pred_res_adbc)
# Transform to dataframe for easier plotting
res_adbc_cm_df = pd.DataFrame(res_adbc_cm,
index = ['Not Serious','Serious'],
columns = ['Not Serious','Serious'])
plt.figure(figsize=(15,5))
sns.heatmap(res_adbc_cm_df, annot=True, fmt="d", cmap='viridis', linecolor='black', linewidths=1)
plt.title('Resampled AdaBoost Classifier Accuracy: {0:.2f}%'
.format(accuracy_score(res_y_test,pred_res_adbc )*100),fontsize=15)
plt.ylabel('Actual\n')
plt.xlabel('Predicted\n')
plt.show()
#end time
end_res_adbc = time.time()
print("Resampled AdaBoost Classifier Time:", end_res_adbc - start_res_adbc)
#extracting true_positives, false_positives, true_negatives, false_negatives
tn, fp, fn, tp = confusion_matrix(res_y_test,pred_res_adbc).ravel()
accuracy = accuracy_score(res_y_test,pred_res_adbc)*100
specificity = tn/(tn+fp)*100
fpr = fp/(tn+fp)*100
ers = 100-accuracy
print("Resampled AdaBoost Classifier Specificity Score: {0:.2f}%".format(specificity))
print("Resampled AdaBoost Classifier False Positive Rate Score: {0:.2f}%".format(fpr))
print("Resampled AdaBoost Classifier Error Rate Score: {0:.2f}%".format(ers))
print("Resampled AdaBoost Classifier Accuracy Score: {:0.2f}%"
.format(accuracy_score(res_y_test,pred_res_adbc )*100))
print("Resampled AdaBoost Classifier F1 Score: {:0.2f}%"
.format(f1_score(res_y_test, pred_res_adbc,average="macro")*100))
print("Resampled AdaBoost Classifier Precision Score: {:0.2f}%"
.format(precision_score(res_y_test, pred_res_adbc, average="macro")*100))
print("Resampled AdaBoost Classifier Recall Score: {:0.2f}%"
.format(recall_score(res_y_test, pred_res_adbc, average="macro")*100))
print("Resampled AdaBoost Classifier Roc Auc Score: {0:.2f}%"
.format(roc_auc_score(res_y_test,pred_res_adbc)*100))
#start
start_res_rfc = time.time()
#random forest
res_rfc = RandomForestClassifier(criterion='entropy', max_depth=40,
max_features=X.shape[1], min_samples_split=8,
n_estimators=500, random_state=42)
res_rfc.fit(res_X_train, res_y_train)
pred_res_rfc = res_rfc.predict(res_X_test)
#cv
print("Resampled Random Forest Classifier Cross Validation Score: {:0.2f}%"
.format(np.mean(cross_val_score(res_rfc, res_X_train, res_y_train, cv=3)*100)))
print('\n')
# Creates a confusion matrix
res_rfc_cm = confusion_matrix(res_y_test,pred_res_rfc)
# Transform to df for easier plotting
res_rfc_cm_df = pd.DataFrame(res_rfc_cm,
index = ['Not Serious','Serious'],
columns = ['Not Serious','Serious'])
plt.figure(figsize=(15,5))
sns.heatmap(res_rfc_cm_df, annot=True, fmt="d", cmap='viridis', linecolor='black', linewidths=1)
plt.title('Resampled Random Forest Accuracy: {0:.2f}%'.format(accuracy_score(res_y_test,
pred_res_rfc)*100),
fontsize=15)
plt.ylabel('Actual\n')
plt.xlabel('Predicted\n')
plt.show()
end_res_rfc = time.time()
print("\nResampled Random Forest Time: ", end_res_rfc - start_res_rfc)
#extracting true_positives, false_positives, true_negatives, false_negatives
tn, fp, fn, tp = confusion_matrix(res_y_test,pred_res_rfc).ravel()
accuracy = accuracy_score(res_y_test,pred_res_rfc)*100
specificity = tn/(tn+fp)*100
fpr = fp/(tn+fp)*100
ers = 100-accuracy
print("Resampled Random Forest Classifier Specificity Score: {0:.2f}%".format(specificity))
print("Resampled Random Forest Classifier False Positive Rate Score: {0:.2f}%".format(fpr))
print("Resampled Random Forest Classifier Error Rate Score: {0:.2f}%".format(ers))
print("Resampled Random Forest Classifier Accuracy Score: {:0.2f}%"
.format(accuracy_score(res_y_test,pred_res_rfc )*100))
print("Resampled Random Forest Classifier F1 Score: {:0.2f}%"
.format(f1_score(res_y_test, pred_res_rfc,average="macro")*100))
print("Resampled Random Forest Classifier Precision Score: {:0.2f}%"
.format(precision_score(res_y_test, pred_res_rfc, average="macro")*100))
print("Resampled Random Forest Classifier Recall Score: {:0.2f}%"
.format(recall_score(res_y_test, pred_res_rfc, average="macro")*100))
print("Resampled Random Forest Classifier Roc Auc Score: {0:.2f}%"
.format(roc_auc_score(res_y_test, pred_res_rfc)*100))
# #Resampled Gradient Boosting Classifier was taken out of the running due to its run time of almost a day
# start_res_gbc = time.time()
# res_gbc = ensemble.GradientBoostingClassifier(learning_rate=0.05, max_depth=40,
# min_samples_leaf=1, n_estimators=500,
# random_state = 42)
# res_gbc.fit(res_X_train, res_y_train)
# pred_res_gbc = res_gbc.predict(res_X_test)
# #Check accuracy
# print("Resampled Gradient Boosting Classifier Accuracy Score: {:0.2f}%"
# .format(accuracy_score(res_y_test,pred_res_gbc )*100))
# print("Resampled Gradient Boosting Classifier F1 Score: {:0.2f}%"
# .format(f1_score(res_y_test, pred_res_gbc,average="macro")*100))
# print("Resampled Gradient Boosting Classifier Precision Score: {:0.2f}%"
# .format(precision_score(res_y_test, pred_res_gbc, average="macro")*100))
# print("Resampled Gradient Boosting Classifier Recall Score: {:0.2f}%"
# .format(recall_score(res_y_test, pred_res_gbc, average="macro")*100))
# print("Resampled Gradient Boosting Classifier Cross Validation Score: {:0.2f}%"
# .format(np.mean(cross_val_score(res_gbc, res_X_train, res_y_train, cv=5)*100)))
# print('\n')
# # Creates a confusion matrix
# res_gbc_cm = confusion_matrix(res_y_test,pred_res_gbc)
# # Transform to df for easier plotting
# res_gbc_cm_df = pd.DataFrame(res_gbc_cm,
# index = ['Not Serious','Serious'],
# columns = ['Not Serious','Serious'])
# plt.figure(figsize=(15,5))
# sns.heatmap(res_gbc_cm_df, annot=True, fmt="d", cmap='viridis', linecolor='black', linewidths=1)
# plt.title('Resampled Gradient Boosting Classifier Accuracy: {0:.2f}%'.format(accuracy_score(res_y_test,
# pred_res_gbc)*100),
# fontsize=15)
# plt.ylabel('Actual\n')
# plt.xlabel('Predicted\n')
# plt.show()
# end_res_gbc = time.time()
# print("\nResampled Gradient Boosting Time: ", end_res_gbc - start_res_gbc)
# print("Resampled Gradient Boosting Time:", end_res_gbc - start_res_gbc)
#place results for Gradient Boosting here (from machine learning notebook) but do NOT re-run.
# Resampled Gradient Boosting Classifier Accuracy Score: 58.26%
# Resampled Gradient Boosting Classifier F1 Score: 48.58%
# Resampled Gradient Boosting Classifier Precision Score: 54.15%
# Resampled Gradient Boosting Classifier Recall Score: 59.65%
# Resampled Gradient Boosting Classifier Cross Validation Score: 61.43%
# Resampled Gradient Boosting Time: 67961.71300411224
# Confusion Matrix:
# [[71301,52009],
# [6540,10434]]
#Light GBM
start_res_lgbm = time.time()
res_lgbm = lgb.LGBMClassifier(learning_rate =0.03, max_depth=40, min_data_in_leaf=10,
n_estimators=500, num_leaves=50, random_state = 42)
res_lgbm.fit(res_X_train, res_y_train)
pred_res_lgbm = res_lgbm.predict(res_X_test)
#check cv
print("Resampled LightGBM Classifier Cross Validation Score: {:0.2f}%"
.format(np.mean(cross_val_score(res_lgbm, res_X_train, res_y_train, cv=5)*100)))
print('\n')
res_lgbm_cm = confusion_matrix(res_y_test, pred_res_lgbm)
# Transform to df for easier plotting
res_lgbm_cm_df = pd.DataFrame(res_lgbm_cm,
index = ['Not Serious','Serious'],
columns = ['Not Serious','Serious'])
plt.figure(figsize=(15,5))
sns.heatmap(res_lgbm_cm_df, annot=True, fmt="d", cmap='viridis', linecolor='black', linewidths=1)
plt.title('Resampled LightGBM Accuracy: {0:.2f}%'.format(accuracy_score(res_y_test,
pred_res_lgbm)*100),
fontsize=15)
plt.ylabel('Actual\n')
plt.xlabel('Predicted\n')
plt.show()
end_res_lgbm = time.time()
print("\nResampled LightGBM Time: ", end_res_lgbm - start_res_lgbm)
#extracting true_positives, false_positives, true_negatives, false_negatives
tn, fp, fn, tp = confusion_matrix(res_y_test,pred_res_lgbm).ravel()
accuracy = accuracy_score(res_y_test,pred_res_lgbm)*100
specificity = tn/(tn+fp)*100
fpr = fp/(tn+fp)*100
ers = 100-accuracy
print("Resampled LightGBM Classifier Specificity Score: {0:.2f}%".format(specificity))
print("Resampled LightGBM Classifier False Positive Rate Score: {0:.2f}%".format(fpr))
print("Resampled LightGBM Classifier Error Rate Score: {0:.2f}%".format(ers))
#check accuracy
print("Resampled LightGBM Classifier Accuracy Score: {:0.2f}%"
.format(accuracy_score(res_y_test,pred_res_lgbm )*100))
print("Resampled LightGBM Classifier F1 Score: {:0.2f}%"
.format(f1_score(res_y_test, pred_res_lgbm,average="macro")*100))
print("Resampled LightGBM Classifier Precision Score: {:0.2f}%"
.format(precision_score(res_y_test, pred_res_lgbm, average="macro")*100))
print("Resampled LightGBM Classifier Recall Score: {:0.2f}%"
.format(recall_score(res_y_test, pred_res_lgbm, average="macro")*100))
print("Resampled LightGBM Classifier Roc Auc Score: {0:.2f}%"
.format(roc_auc_score(res_y_test, pred_res_lgbm)*100))
#XGBoost
start_res_xgb = time.time()
res_xgb = XGBClassifier(learning_rate=0.05, n_estimators=500, subsample= 1,random_state = 42,
gamma = 1, max_depth=40)
res_xgb.fit(res_X_train, res_y_train)
pred_res_xgb = res_xgb.predict(res_X_test)
#check accuracy
print("Resampled XGBoost Classifier Cross Validation Score: {:0.2f}%"
.format(np.mean(cross_val_score(res_xgb, res_X_train, res_y_train, cv=3)*100)))
print('\n')
# Transform to df for easier plotting of confusion matrix
res_xgb_cm = confusion_matrix(res_y_test, pred_res_xgb)
res_xgb_cm_df = pd.DataFrame(res_xgb_cm,
index = ['Not Serious','Serious'],
columns = ['Not Serious','Serious'])
plt.figure(figsize=(15,5))
sns.heatmap(res_xgb_cm_df, annot=True, fmt="d", cmap='viridis', linecolor='black', linewidths=1)
plt.title('Resampled XGBoost Accuracy: {0:.2f}%'.format(accuracy_score(res_y_test,
pred_res_xgb)*100),
fontsize=15)
plt.ylabel('Actual\n')
plt.xlabel('Predicted\n')
plt.show()
end_res_xgb = time.time()
print("Resampled XGBoost Time:", end_res_xgb - start_res_xgb)
#extracting true_positives, false_positives, true_negatives, false_negatives
tn, fp, fn, tp = confusion_matrix(res_y_test,pred_res_xgb).ravel()
accuracy = accuracy_score(res_y_test,pred_res_xgb)*100
specificity = tn/(tn+fp)*100
fpr = fp/(tn+fp)*100
ers = 100-accuracy
print("Resampled XGBoost Classifier Specificity Score: {0:.2f}%".format(specificity))
print("Resampled XGBoost Classifier False Positive Rate Score: {0:.2f}%".format(fpr))
print("Resampled XGBoost Classifier Error Rate Score: {0:.2f}%".format(ers))
print("Resampled XGBoost Classifier Accuracy Score: {:0.2f}%"
.format(accuracy_score(res_y_test,pred_res_xgb)*100))
print("Resampled XGBoost Classifier F1 Score: {:0.2f}%"
.format(f1_score(res_y_test, pred_res_xgb,average="macro")*100))
print("Resampled XGBoost Classifier Precision Score: {:0.2f}%"
.format(precision_score(res_y_test, pred_res_xgb, average="macro")*100))
print("Resampled XGBoost Classifier Recall Score: {:0.2f}%"
.format(recall_score(res_y_test, pred_res_xgb, average="macro")*100))
print("Resampled XGBoost Classifier Roc Auc Score: {0:.2f}%"
.format(roc_auc_score(res_y_test, pred_res_xgb)*100))
For the following "Balanced" algorithms from imblearn we will be using the standard testing and training sets (X_train, X_test, y_train, y_test) and will allow the algorithms to do the resampling.
For the sampling_strategy, we will be using "majority" as the solution.
'majority': resample only the majority class
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=27)
#start
start_res_bbag = time.time()
# Balanced Bagging Classifier
res_bbag = BalancedBaggingClassifier(max_features=X.shape[1], n_estimators=500, replacement=True,
sampling_strategy='majority', random_state=42)
res_bbag.fit(X_train, y_train)
pred_res_bbag = res_bbag.predict(X_test)
# Creates a confusion matrix
res_bbag_cm = confusion_matrix(y_test,pred_res_bbag)
# Transform to df for easier plotting
res_bbag_cm_df = pd.DataFrame(res_bbag_cm,
index = ['Not Serious','Serious'],
columns = ['Not Serious','Serious'])
plt.figure(figsize=(15,5))
sns.heatmap(res_bbag_cm_df, annot=True, fmt="d", cmap='viridis', linecolor='black', linewidths=1)
plt.title('Resampled Balanced Bagging Accuracy: {0:.2f}%'.format(accuracy_score(y_test,pred_res_bbag )*100),
fontsize=15)
plt.ylabel('Actual\n')
plt.xlabel('Predicted\n')
plt.show()
print("Resampled Balanced Bagging Classifier Cross Validation Score: {:0.2f}%"
.format(np.mean(cross_val_score(res_bbag, X_train, y_train, cv=5)*100)))
print('\n')
#end
end_res_bbag = time.time()
print("\nResampled Balanced Bagging Time: ",end_res_bbag - start_res_bbag)
#extracting true_positives, false_positives, true_negatives, false_negatives
tn, fp, fn, tp = confusion_matrix(y_test,pred_res_bbag).ravel()
accuracy = accuracy_score(y_test,pred_res_bbag)*100
specificity = tn/(tn+fp)*100
fpr = fp/(tn+fp)*100
ers = 100-accuracy
print("Resampled Balanced Bagging Classifier Specificity Score: {0:.2f}%".format(specificity))
print("Resampled Balanced Bagging Classifier False Positive Rate Score: {0:.2f}%".format(fpr))
print("Resampled Balanced Bagging Classifier Error Rate Score: {0:.2f}%".format(ers))
#Check scores
print("Resampled Balanced Bagging Classifier Accuracy Score: {:0.2f}%"
.format(accuracy_score(y_test,pred_res_bbag )*100))
print("Resampled Balanced Bagging Classifier F1 Score: {:0.2f}%"
.format(f1_score(y_test, pred_res_bbag,average="macro")*100))
print("Resampled Balanced Bagging Classifier Precision Score: {:0.2f}%"
.format(precision_score(y_test, pred_res_bbag, average="macro")*100))
print("Resampled Balanced Bagging Classifier Recall Score: {:0.2f}%"
.format(recall_score(y_test, pred_res_bbag, average="macro")*100))
print("Resampled Balanced Bagging Classifier Roc Auc Score: {0:.2f}%"
.format(roc_auc_score(y_test, pred_res_bbag)*100))
#start
start_res_eec = time.time()
#EasyEnsembleClassifier
res_eec = EasyEnsembleClassifier(n_estimators=500, random_state=42, replacement=True,
sampling_strategy='majority')
res_eec.fit(X_train, y_train)
pred_res_eec = res_eec.predict(X_test)
print("Resampled Balanced Easy Ensemble Classifier Cross Validation Score: {:0.2f}%"
.format(np.mean(cross_val_score(res_eec, X_train, y_train, cv=5)*100)))
print('\n')
# Creates a confusion matrix
res_eec_cm = confusion_matrix(y_test,pred_res_eec)
# Transform to df for easier plotting
res_eec_cm_df = pd.DataFrame(res_eec_cm,
index = ['Not Serious','Serious'],
columns = ['Not Serious','Serious'])
plt.figure(figsize=(15,5))
sns.heatmap(res_eec_cm_df, annot=True, fmt="d", cmap='viridis', linecolor='black', linewidths=1)
plt.title('Resampled Balanced Easy Ensemble Accuracy: {0:.2f}%'.format(accuracy_score(y_test,pred_res_eec )*100),
fontsize=15)
plt.ylabel('Actual\n')
plt.xlabel('Predicted\n')
plt.show()
#end
end_res_eec = time.time()
print("\nResampled Balanced Easy Ensemble Time: ",end_res_eec - start_res_eec)
#extracting true_positives, false_positives, true_negatives, false_negatives
tn, fp, fn, tp = confusion_matrix(res_y_test,pred_res_eec).ravel()
accuracy = accuracy_score(res_y_test,pred_res_eec)*100
specificity = tn/(tn+fp)*100
fpr = fp/(tn+fp)*100
ers = 100-accuracy
print("Resampled Balanced Easy Ensemble Classifier Specificity Score: {0:.2f}%".format(specificity))
print("Resampled Balanced Easy Ensemble Classifier False Positive Rate Score: {0:.2f}%".format(fpr))
print("Resampled Balanced Easy Ensemble Classifier Error Rate Score: {0:.2f}%".format(ers))
#Check accuracy
print("Resampled Balanced Easy Ensemble Classifier Accuracy Score: {:0.2f}%"
.format(accuracy_score(y_test,pred_res_eec )*100))
print("Resampled Balanced Easy Ensemble Classifier F1 Score: {:0.2f}%"
.format(f1_score(y_test, pred_res_eec,average="macro")*100))
print("Resampled Balanced Easy Ensemble Classifier Precision Score: {:0.2f}%"
.format(precision_score(y_test, pred_res_eec, average="macro")*100))
print("Resampled Balanced Easy Ensemble Classifier Recall Score: {:0.2f}%"
.format(recall_score(y_test, pred_res_eec, average="macro")*100))
print("Resampled Balanced Easy Ensemble Classifier Roc Auc Score: {0:.2f}%"
.format(roc_auc_score(y_test, pred_res_eec)*100))
#start
start_res_brfc = time.time()
# Balanced Random Forest Classifier
res_brfc = BalancedRandomForestClassifier(criterion='entropy', max_depth=40,
min_samples_leaf = 1, max_features=X.shape[1],
sampling_strategy='majority', replacement=True,
min_samples_split=8, n_estimators=500,
random_state=42)
res_brfc.fit(X_train, y_train)
pred_res_brfc = res_brfc.predict(X_test)
# Creates a confusion matrix
res_brfc_cm = confusion_matrix(y_test,pred_res_brfc)
# Transform to df for easier plotting
res_brfc_cm_df = pd.DataFrame(res_brfc_cm,
index = ['Not Serious','Serious'],
columns = ['Not Serious','Serious'])
plt.figure(figsize=(15,5))
sns.heatmap(res_brfc_cm_df, annot=True, fmt="d", cmap='viridis', linecolor='black', linewidths=1)
plt.title('Resampled Balanced Random Forest Accuracy: {0:.2f}%'.format(accuracy_score(y_test,pred_res_brfc )*100),
fontsize=15)
plt.ylabel('Actual\n')
plt.xlabel('Predicted\n')
plt.show()
print("Resampled Balanced Random Forest Classifier Cross Validation Score: {:0.2f}%"
.format(np.mean(cross_val_score(res_brfc, X_train, y_train, cv=5)*100)))
print('\n')
#end
end_res_brfc = time.time()
print("\nResampled Balanced Random Forest Time: ",end_res_brfc - start_res_brfc)
#extracting true_positives, false_positives, true_negatives, false_negatives
tn, fp, fn, tp = confusion_matrix(res_y_test,pred_res_brfc).ravel()
accuracy = accuracy_score(res_y_test,pred_res_brfc)*100
specificity = tn/(tn+fp)*100
fpr = fp/(tn+fp)*100
ers = 100-accuracy
print("Resampled Balanced Random Forest Classifier Specificity Score: {0:.2f}%".format(specificity))
print("Resampled Balanced Random Forest Classifier False Positive Rate Score: {0:.2f}%".format(fpr))
print("Resampled Balanced Random Forest Classifier Error Rate Score: {0:.2f}%".format(ers))
#Check accuracy
#Check accuracy
print("Resampled Balanced Random Forest Classifier Accuracy Score: {:0.2f}%"
.format(accuracy_score(y_test,pred_res_brfc )*100))
print("Resampled Balanced Random Forest Classifier F1 Score: {:0.2f}%"
.format(f1_score(y_test, pred_res_brfc,average="macro")*100))
print("Resampled Balanced Random Forest Classifier Precision Score: {:0.2f}%"
.format(precision_score(y_test, pred_res_brfc, average="macro")*100))
print("Resampled Balanced Random Forest Classifier Recall Score: {:0.2f}%"
.format(recall_score(y_test, pred_res_brfc, average="macro")*100))
print("Resampled Balanced Random Forest Classifier Roc Auc Score: {0:.2f}%"
.format(roc_auc_score(y_test, pred_res_brfc)*100))
Below we have compiled a dataframe and visualization of the scores above in order to determine which algorithm would be best for this data.
#create list of results
results_data={'Learning Algorithm':['Bagging','AdaBoost', 'Random Forest', 'LightGBM','XGBoost',
'Balanced Bagging', 'Easy Ensemble', 'Balanced Random Forest'],
'Accuracy Score':[66.97,66.74,67.09,67.81,66.8,78.53,66.61,67.28],
'F1 Score ':[55.81,54.9,55.87,56.33,55.79,61.97,54.96,56.12],
'Precision Score':[58.1,57.14,58.1,58.27,58.17,60.58,57.27,58.3],
'Recall Score':[67.88,65.58,67.85,68.04,68.1,67.01,65.95,68.28],
'Cross Validation Score':[69.11,65.73,69.15,68.32,69.24,78.47,66.83,67.28],
'Specificity Score':[66.68,67.12,66.84,67.74,66.38,82.21,66.82,66.96],
'Error Rate':[33.03,33.26,32.91,32.19,33.2,17.79,33.39,32.72],
'False Positive Rate':[33.32,32.88,33.16,32.26,33.62,21.47,33.18,33.04],
'Roc Auc Score':[67.88,65.58,67.85,68.04,68.1,67.01,65.95,68.28],
'Time':[5531.351397,389.835886,4370.322077,61.45835494995117,4441.273263931274,
12142.18031,37473.19004,7261.670822],
'Learning Library':['Sklearn', 'Sklearn', 'Sklearn', 'LightGBM', 'XGBoost',
'Imblearn', 'Imblearn', 'Imblearn']}
#create dataframe
results=pd.DataFrame(results_data)
results.head(10)
#change time to minutes
results['Time in Minutes'] = round(results['Time']/60, 2)
#drop actual Time column
results=results.drop('Time',axis=1)
#rearrange columns
results = results[['Learning Algorithm', 'Accuracy Score', 'F1 Score ', 'Precision Score',
'Recall Score', 'Cross Validation Score', 'Specificity Score', 'Error Rate',
'False Positive Rate','Roc Auc Score','Time in Minutes', 'Learning Library']]
results.set_index('Learning Algorithm', inplace=True)
results.head(10)
#csv file for Tableau
results.to_csv('learning_results.csv')
%%HTML
<div class='tableauPlaceholder' id='viz1572177218898' style='position: relative'><noscript><a href='https://github.com/GenTaylor/Traffic-Accident-Analysis'><img alt=' ' src='https://public.tableau.com/static/images/Le/LearningAlgorithmResults/LearningAlgorithmsScores/1_rss.png' style='border: none' /></a></noscript><object class='tableauViz' style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='LearningAlgorithmResults/LearningAlgorithmsScores' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https://public.tableau.com/static/images/Le/LearningAlgorithmResults/LearningAlgorithmsScores/1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div> <script type='text/javascript'> var divElement = document.getElementById('viz1572177218898'); var vizElement = divElement.getElementsByTagName('object')[0]; vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px'; var scriptElement = document.createElement('script'); scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js'; vizElement.parentNode.insertBefore(scriptElement, vizElement); </script>
%%HTML
<div class='tableauPlaceholder' id='viz1572079997269' style='position: relative'><noscript><a href='https://github.com/GenTaylor/Traffic-Accident-Analysis'><img alt=' ' src='https://public.tableau.com/static/images/Le/LearningAlgorithmResults/LearningAlgorithmsRates/1_rss.png' style='border: none' /></a></noscript><object class='tableauViz' style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='LearningAlgorithmResults/LearningAlgorithmsRates' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https://public.tableau.com/static/images/Le/LearningAlgorithmResults/LearningAlgorithmsRates/1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='useGuest' value='true' /></object></div> <script type='text/javascript'> var divElement = document.getElementById('viz1572079997269'); var vizElement = divElement.getElementsByTagName('object')[0]; vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px'; var scriptElement = document.createElement('script'); scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js'; vizElement.parentNode.insertBefore(scriptElement, vizElement); </script>
%%HTML
<div class='tableauPlaceholder' id='viz1572080028730' style='position: relative'><noscript><a href='https://github.com/GenTaylor/Traffic-Accident-Analysis'><img alt=' ' src='https://public.tableau.com/static/images/Le/LearningAlgorithmResults/LearningAlgorithmsTime/1_rss.png' style='border: none' /></a></noscript><object class='tableauViz' style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='LearningAlgorithmResults/LearningAlgorithmsTime' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https://public.tableau.com/static/images/Le/LearningAlgorithmResults/LearningAlgorithmsTime/1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='useGuest' value='true' /></object></div> <script type='text/javascript'> var divElement = document.getElementById('viz1572080028730'); var vizElement = divElement.getElementsByTagName('object')[0]; vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px'; var scriptElement = document.createElement('script'); scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js'; vizElement.parentNode.insertBefore(scriptElement, vizElement); </script>
Based on the visualizations above, Balanced Bagging Classifier from imblearn is the algorithm of choice for this data. While some of the scores may have been close, Balanced Bagging Classifier had higher scores in Accuracy, Cross Validation, and Specificity. The algorithm also had the lower Error Rate and False Positive Rates of the group. It’s prediction of Serious accidents was close to being overall inaccurate but in the end, I was comfortable with the findings.
Genesis L. Taylor
Github | Linkedin | Tableau | genesisltaylor@gmail.com